Differentiating nontuberculous mycobacterium pulmonary disease from pulmonary tuberculosis through the analysis of the cavity features in CT images using radiomics

Objective To differentiate nontuberculous mycobacteria (NTM) pulmonary diseases from pulmonary tuberculosis (PTB) by analyzing the CT radiomics features of their cavity. Methods 73 patients of NTM pulmonary diseases and 69 patients of PTB with the cavity in Shandong Province Chest Hospital and Qilu Hospital of Shandong University were retrospectively analyzed. 20 patients of NTM pulmonary diseases and 20 patients of PTB with the cavity in Jinan Infectious Disease Hospitall were collected for external validation of the model. 379 cavities as the region of interesting (ROI) from chest CT images were performed by 2 experienced radiologists. 80% of cavities were allocated to the training set and 20% to the validation set using a random number generated by a computer. 1409 radiomics features extracted from the Huiying Radcloud platform were used to analyze the two kinds of diseases' CT cavity characteristics. Feature selection was performed using analysis of variance (ANOVA) and least absolute shrinkage and selection operator (LASSO) methods, and six supervised learning classifiers (KNN, SVM, XGBoost, RF, LR, and DT models) were used to analyze the features. Results 29 optimal features were selected by the variance threshold method, K best method, and Lasso algorithm.and the ROC curve values are obtained. In the training set, the AUC values of the six models were all greater than 0.97, 95% CI were 0.95–1.00, the sensitivity was greater than 0.92, and the specificity was greater than 0.92. In the validation set, the AUC values of the six models were all greater than 0.84, 95% CI were 0.76–1.00, the sensitivity was greater than 0.79, and the specificity was greater than 0.79. In the external validation set, The AUC values of the six models were all greater than 0.84, LR classifier has the highest precision, recall and F1-score, which were 0.92, 0.94, 0.93. Conclusion The radiomics features extracted from cavity on CT images can provide effective proof in distinguishing the NTM pulmonary disease from PTB, and the radiomics analysis shows a more accurate diagnosis than the radiologists. Among the six classifiers, LR classifier has the best performance in identifying two diseases. Supplementary Information The online version contains supplementary material available at 10.1186/s12890-021-01766-2.


Introduction
Radiomics is an emerging field that aims at building a relevant statistical model from a large number of high-dimensional mineable features extracted from medical imaging data (possibly combined with clinical or genomic data) to assist diagnosis, prognosis and therapy monitoring.
Radiomics workflow involves: Imaging, ROI segmentation, feature extraction and analysis, then with the selected features, a statistical model is designed based on machine learning algorithms, which have to be tuned according to the clinical or biological question and to the a priori knowledge that is available.

Patients and Dataset
A total of 0 patients(114 men and 61 women; maen age, 51 years ±17.44; range, 17-90 years) were included in this study. We used Radcloud (Huiying Medical Technology Co., Ltd) to manage imaging data, clinical data, and subsequent radiomics statistics analysis.
The training dataset and validation dataset were separated by random method with ratio 2:8, and the random seeds is 687

Feature extraction
A total of 1409 quantitative imaging features were extracted from CT images with Radcloud platform (http://radcloud.cn/). These features can be grouped into three groups. Group 1 (first order statistics) consisted of 126 descriptors that quantitatively delineate the distribution of voxel intensities within the CT image through commonly used and basic metrics. Group 2 (shape-and size-based features) contained 14 threedimensional features that reflect the shape and size of the region. Calculated from grey level run-length and grey level co-occurrence texture matrices, 525 textural features that can quantify region heterogeneity differences were classified into group 3 (texture features).

Feature qualification
As described above, a large number of image features may be computed. However, all these extracted features may not be useful for a particular task. Therefore, dimensionality reduction and selection of task-specific features for best performance are necessary steps. To reduce the redundant features, the feature selection methods included the variance threshold (variance threshold = 0.8), SelectKBest and the least absolute shrinkage and selection operator (LASSO) were used for this purpose. For the variance threshold method, the threshold is 0.8, so that the eigenvalues of the variance smaller than 0.8 were removed. The SelectKBest method, which belongs to a single variable feature selection method, using p value to analysis the relationship between the features and the classification results, all the features with p value smaller than 0.05 will be used. For LASSO model, L1 regularizer was used as the cost function, and the error value of cross validation is 5, and the maximum number of iterations is 1000.

Statistical analysis
The statistical analysis was performed in Radcloud platform. And after feature qualification, a total of 1409 features identified were significantly correlated to this subject. Based on the selected features, there are several supervised learning classifiers available for classification analysis, which creates models that attempt to separate or predict the data with respect to an outcome or phenotype (for instance, patient outcome or response). In this study, the radiomics-based models were constructed with 6 classifiers, k-NearestNeighbor(KNN), Support Vector Machin(SVM), eXtreme Gradient Boosting(XGBoost), Random Forest (RF), Logistic Regression (LR) and Decision tree(DT), and the validation method was used to improve the effectiveness of the model.
Tab. 1 Description of the selected radiomic features with their associated feature group and filter